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The non-coding transcriptome of the hyperthermophilic archaeon Pyrococcus abyssi is investigated using the RNA-seq 
technology. A dedicated computational pipeline analyzes RNA-seq reads and prior genome annotation to identify small 
RNAs, untranslated regions of mRNAs, and cis-encoded antisense transcripts. Unlike other archaea, such as Sulfolobus 
and Halobacteriales, P. abyssi produces few leaderless mRNA transcripts. Antisense transcription is widespread (215 
transcripts) and targets protein-coding genes that are less conserved than average genes. We identify at least three novel 
H/ACA-like guide RNAs among the newly characterized non-coding RNAs. Long 5' UTRs in mRNAs of ribosomal proteins 
and amino-acid biosynthesis genes strongly suggest the presence of cis-regulatory leaders in these mRNAs. We selected 
a high-interest subset of non-coding RNAs based on their strong promoters, high GC-content, phylogenetic conservation, 
or abundance. Some of the novel small RNAs and long 5' UTRs display high GC contents, suggesting unknown structural 
RNA functions. However, we were surprised to observe that most of the high-interest RNAs are AU-rich, which suggests 
an absence of stable secondary structure in the high-temperature environment of P. abyssi. Yet, these transcripts display 
other hallmarks of functionality, such as high expression or high conservation, which leads us to consider possible RNA 
functions that do not require extensive secondary structure. 



Introduction 

Widespread or "pervasive" transcription of genomic regions out- 
side protein-coding genes is now well established in a wide range 
of eukaryotic species. The case for pervasive transcription in bac- 
terial and archeal genomes is not as clear since these genomes are 
compact with short intergenic regions that are often part of tran- 
scribed operons. Yet, several recent studies used deep sequencing 
or tiling arrays to evaluate non-coding transcription in a variety 
of bacterial species and identified large amounts of small RNAs, 
antisense transcripts and UTR extensions of protein coding genes 
(reviewed in ref. 1). Although the term "pervasive transcription" 
is not often associated to non-eukaryotic organisms, it turns out 
that a large part of non-coding regions in bacteria are covered by 
regulatory RNAs or UTR extensions, and that many, if not all, 
bacterial coding genes produce antisense transcripts. 

Screens for non-coding RNAs (ncRNAs) in archaea are not 
as developed as in the other domains of life. Early attempts of 
archeal Rnomics involved cDNA cloning 2 and computational 
screens possibly combined to northern blot or PCR validation. 3 " 6 
Classes of RNAs identified by these archeal screens notably 
differed from bacterial RNAs and were dominated by modifi- 
cation guides H/ACA and C/D-box RNAs. These approaches 

Correspondence to: Daniel Gautheret; Email: daniel.gautheret@u-psud.fr 
Submitted: 05/05/13; Revised: 06/25/13; Accepted: 06/27/13 
http://dx.doi.org/! 0.41 61/rna.25567 



are progressively superseded by high-throughput sequencing 
technologies enabling deep sequencing of total or size-selected 
RNA. In addition to new H/ACA and C/D box RNAs, 7 " 10 deep 
sequencing identified a number of new CRISPR (a defense sys- 
tem present in bacteria and archaea) RNA loci 7,8,10 as well as wide- 
spread antisense transcription of coding and non-coding loci. 7,8 
Such "cis-encoded antisenses" are clearly a significant part of 
the non-coding transcriptome as they were visible already using 
low-throughput Rnomics." Other recently identified classes of 
archeal RNAs include circular RNAs, 9 split RNA genes, 10 and a 
bacterial-like trans-acting small RNA. 12 

Among archaeal species sampled by Rnomics studies lie a 
number of hyperthermophiles (growth temperatures higher than 
80°C): Pyrococcus abyssi, Nanoarchaeum equitans, Sulfolobus solfa- 
taricus (refs. 2-3, 5-6, and 10) and members of the Pyrobaculum 
genus. 7 Counter-intuitively, organisms living at high temperature 
do not necessarily have GC-rich genomes. 13 " 15 Indeed, topo- 
logical constraints on circular DNA molecules may enforce 
double strand formation up to over 100°C independently of 
GC-content. 16 However, structured RNAs, such as tRNAs 
and rRNAs, tend to have very high GC-content in hyperther- 
mophiles 13 possibly because they do not have such constraints. 
Computational biologists embraced this discrepancy to develop 
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Figure 1. Consensus promoter motifs. Each frame shows the best ranking sequence motif identified by a MEME search 45 performed on the 50 nt up- 
stream region of: (A) CDS annotations; (B) RNA-seq-based transcription units; (C) long 5' UTRs, (D) sRNAs, (E) asRNAs. Number of occurrences, MEME P 
value and number of sites are given for each motif. Motif coordinates are numbered from the first base of the RNA-seq transcription unit (B-D) or from 
the ATG start codon (B) and correspond to the dominant motif location. No dominant location was found for the asRNA motif (E). 



ncRNA detection programs based on local GC-enrichment in 
otherwise AT-rich genomes. 2,4 These algorithms were successful 
in identifying a number of structured RNAs but the question of 
non-coding RNA genes in hyperthermophiles has been set aside 
since then. Currently, it is not known whether the multiple non- 
coding RNAs identified by deep sequencing adopt secondary 
structures like rRNA and tRNA do. 

Here, we sequenced the complete transcriptome of P. abyssi 
grown at 92°C and reconstructed the set of coding and non- 
coding transcripts. Taking advantage of a higher coverage than 
in previous P. abyssi transcriptome screens, we confirmed a set 
of short or extended ORFs and estimated the number of novel 
non-coding elements in each major ncRNA category: indepen- 
dent transcripts, cis-encoded antisenses, and UTR extensions. 
We analyzed the properties of novel RNA candidates in each 
category and identified several new functional RNAs of the H/ 
AC A box class, and possibly of the cis-regulatory RNA class. 
Finally, we analyzed the G:C content and abundance of new 
ncRNA elements and showed how they differ from ncRNAs 
identified in previous experimental and computational screens. 
This analysis revealed that a large fraction of the newly identi- 
fied ncRNAs do not adopt stable secondary structures although 



they harbor other hallmarks of functional RNAs such as the 
presence of strong promoters, high expression levels, or phyloge- 
netic conservation. 

Results 

RNA deep sequencing and transcript classification. We col- 
lected RNA from a P. abyssi culture grown at 92 °C sampled 
at successive growth stages up to stationary phase and submit- 
ted RNA to directional RNA-seq library preparation. Illumina 
sequencing produced 51 million single reads of length 40 nt of 
which 5.6 million mapped to unique, non-rRNA loci. The P. 
abyssi genome is very dense and comprises a large number of 
genes that overlap at their UTRs. 17 This context makes it dif- 
ficult to distinguish new independent transcripts from the 5' 
and 3' extensions of previously annotated genes. We developed a 
protocol that clusters RNA-seq reads in a strand-specific fashion 
and detects overlaps with previous annotations in order to merge 
clusters into transcription units (TUs). Our basis for previous 
annotation was the NCBI annotation updated by that of Gao et 
al. 18 who identified 115 extra ORFs. TUs overlapping previous 
annotation were split into 5' UTR, 3' UTR, CDS, and operon 
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spacers. New TUs were analyzed for their coding potential and 
classified either as novel protein-coding genes (CDS) or inde- 
pendent ncRNAs, here called small RNAs (sRNA). Small RNAs 
overlapping TUs on the opposite strand were reclassified as cis- 
encoded antisense RNAs (asRNAs). 5' UTR were further classi- 
fied into "long 5' UTRs" when longer than 50 nt. 

Promoters and UTR regions. Consensus sequence motifs 
were extracted from the upstream region of annotated CDSs 
(Fig. 1A) and the region upstream of RNA-seq-based TUs 
(Fig. IB). The Pyrococcus consensus promoter is composed of 
two boxes often referred to in archaea as BRE element and TATA 
box. 19 Here, the TATA box sequence is TTT(A/T) (T/A) AA, 
similar to that of Methanococcus vannielii (TTTATAATA), 20 
and Sulfolobus solfataricus (TTTTAAA). 8 RNA-seq-based tran- 
scription start sites (TSS) are located in average 20 nt down- 
stream of the 3' end of the TATA box (Fig. IB). The fraction 
of transcription units featuring both a TATA and a BRE box 
(68%) is higher than the fraction of annotated CDS in this case 
(53%), indicating that our annotation of TUs improves previ- 
ous CDS annotation. 

Few leaderless transcripts are present in P. abyssi. Previous 
observations in Sulfolobus* and Halobacteriales 1 ^ established 
a majority of leaderless transcripts in these species (> 69% in 
Sulfolobus, > 90% in halobacteriales), when the methanogen 
Methanosarcina mazei revealed a majority of long 5' UTR (up 
to 500 nt), 22 unveiling a variety of situations in Archaea. The 
median size of the P. abyssi 5' UTRs is 37 nt (Fig. SI). We pre- 
dict 292 5' UTRs with a size under 20 nt (Fig. S2A), but most 
can be attributed to imperfect RNA-seq coverage rather than to 
a true lack of leader sequence. Indeed, an analysis of sequences 
upstream of the predicted TSS reveals an RBS motif in 180 of 
the 207 UTRs with a size of 6-16 nt, leaving only 27 transcripts 
that may lack a proper leader (Fig. S2B). Independently of the 
predicted UTR size, the distance between the 3' end of the TATA 
box and the RBS is constant (Fig. S2B) and suggests an actual 
size of these short 5' UTRs of about 20 nt. 

The promoters of long 5' UTRs and independent non-coding 
RNAs (sRNAs) display consensus motifs similar to that of other 
promoters (Fig. 1C and D). However, the fraction of sRNAs with 
a major promoter (27/107 = 25%) is significantly lower than for 
protein coding genes (68%), suggesting a relatively lower accu- 
racy of TSS definition in our sRNA annotation. Antisense RNAs 
(asRNAs) are not preceded by canonical promoters. Instead, 
a small proportion (36/215) are flanked by a motif (Fig. IE), 
which does not correspond to a known regulatory sequence or 
its reverse complement. The fact the majority of asRNAs are not 
flanked by any visible promoter sequence suggests a large fraction 
of these RNAs result either from leaky transcription or process- 
ing of longer transcripts. 

Our sequencing coverage was sufficient to identify 3' UTRs 
of size 10 nt or more for 446 genes. Considering all genes with 
a detectable 3' UTR, the median size of predicted 3' UTRs in 
Pyrococcus is 37 nt (Fig. SI). However, note that single-end 
RNA-seq protocols like the one we used here do not systemati- 
cally sequence the 3' side of cDNA library fragments, thus actual 
3' UTRs are probably longer. 



Table 1. RNA abundance by class 



RNA class 


Number of 
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Total number of 
reads 


Median RPKM 


rRNA 
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32727463 


14,818 


tRNA 


46 
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264 


Long 5' UTR 


98 


172014 


185 


sRNA 


107 


84797 


63 


Antisense RNA 


215 


13253 


43 


CDS 


1,893 


5261317 


164 



Overall portrait of non-coding elements. Table 1 presents 
the amount of RNA-seq reads associated to each class of ncRNA 
element. The most populated classes of ncRNAs are asRNAs 
(215), followed by sRNAs (107) and long 5' UTRs (98). A sub- 
stantial fraction of the sRNAs and long 5' UTRs (respectively, 
37% and 45%) were already identified in previous studies and/or 
annotated as encoding ncRNA elements in databases. However, 
the vast majority of asRNAs (97%) were previously unreported 
(Fig. 2A). Non-coding RNAs are generally less conserved than 
coding genes at the nucleotide sequence level but more than 
intergenic region. Between 47—57% of the non-coding elements 
are specific to P. abyssi, compared with 28% for coding sequences 
and 76% for intergenic region (Fig. 2B). 

We measured RPKM (reads per kb per million) as an 
approximation of RNA abundance (Table 1). Since we used in 
our library preparation an exonuclease degrading 5 -monophos- 
phorylated RNA species (Materials and Methods), we expect 
a depletion of rRNAs and tRNAs, as well as of any processed 
RNAs, including some sRNAs or asRNAs. Keeping this limi- 
tation in mind, the most abundant non-coding elements after 
rRNAs and tRNAs are long 5' UTRs (median: 185 RPKM), 
followed by sRNAs (63 RPKM) and asRNAs (43 RPKM). 
The relatively high abundance of long 5' UTRs only reflects 
that of coding transcripts in general, since the median RPKM 
in coding regions is 164 RPKM (Table 1). Analysis of the 
new transcription units revealed 26 new or extended protein 
coding sequences: nine in sRNAs and in 18 in long 5' UTRs 
(Table SI). 

The coverage of intergenic regions by RNA-seq reads enabled 
us to combine a number of annotated genes into multicistronic 
transcription units (Fig. S2). Overall, 1,466 out of 1,944 anno- 
tated genes (75%) are part of multi-cistronic transcripts and 
are thus likely expressed as operons. Conversely, 444 out of 
888 extended transcription units involve two or more CDSs or 
ncRNA genes. The longest polycistronic transcripts (28 CDS 
and 20 CDS) are ones encoding ribosomal proteins. 

Widespread cis-antisense transcription includes known 
functional RNAs. Although we used a strict definition of anti- 
sense, which we require to be fully embedded in the opposite 
strand of a transcription unit, our asRNA list still includes four 
C/D box guide RNAs. Indeed, one of the longest and most 
expressed asRNAs (PabOH5, size 277 nt: Table SI) matches a 
C/D box RNA (RFAM family RF01130) and sits right against 
the center of a five-gene operon (Fig. S3). Therefore, in the 
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Figure 2. Characteristics of ncRNAs identified by RNA-seq. (A) Numbers of known RNAs vs. novel RNAs. Sources for known ncRNAs: NCBI features, 
RFAM, 23 and studies from Klein et al. 3 and Phok et al. 6 classes are ranked in order of decreasing confidence starting from experimentally validated RNAs 
(from left to right on the bar plot). When the known ncRNA is from P. abyssi, a minimum overlap of 25 nt with the known RNA and the new candidate 
is required to assign the candidate to a class. When the known ncRNA is from another Archaeal species, a BLASTN sequence conservation with a mini- 
mum BLASTN bit score of 42 is required to assign the candidate to a class. When a candidate appears in several classes, it is counted only in the class 
with highest confidence. (B) Conservation of ncRNA classes. At each taxonomic level, the histogram shows the fraction of elements conserved up to, 
and not deeper than, this taxonomic level (see Materials and Methods). Elements shown include the three ncRNA classes (107 sRNAs, 98 long 5' UTRs, 
215 asRNAs), CDS, antisense-associated CDS (CDS-asRNA), and intergenic region. To avoid conservation bias due to size differences, conservation for 
CDS, CDS-asRNA, and intergenic region were obtained on randomly sampled fragments from CDS, CDS-asRNA, and intergenic regions with the same 
size distribution as ncRNA, asRNA, and ncRNA, respectively. 



compact P. abyssi genome, functional trans-acting ncRNAs can 
be produced from fully antisense transcripts. 

Conservation analysis does not support a predominance of 
trans-acting asRNAs. Indeed, such asRNAs would sustain a dual 
selection pressure, first due to their trans-acting role and second 
due to the coding sequence on the opposite strand. However, 
asRNAs are significantly less conserved than average protein 
coding genes (see "CDS," Fig. 2B), which argues against wide- 
spread trans-acting functions. The conservation level of asRNAs 
is similar to that of other ncRNAs (5' UTRs and sRNAs). This 
relatively low conservation also applies to other regions of genes 
harboring asRNA (see CDS-asRNA in Fig. 2B). Moreover, genes 
harboring asRNA are less expressed than average CDS (Fig. S7). 



Altogether, this shows that asRNAs generally occur in a class of 
genes that are less conserved and expressed than average protein- 
coding genes. 

Overall, the antisense category is that with fewer previ- 
ous annotation (Fig. 2A), which is unsurprising since previous 
screens mostly relied on sequence conservation or GC-content 
and systematically excluded coding sequences and their anti- 
sense strands. The number of actual functional RNAs among the 
215 identified antisense elements remains to be determined. We 
selected 71 "high interest" candidate functional asRNAs based 
on their high abundance, large size, strong promoter, or high 
GC-content (noted PabOlOO to PabO170 in Table SI). It should 
be noted that the only four asRNAs with a known function rank 
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Figure 3. (A) Distribution of 107 sRNAsas a function of GC%and RPKM. Unknown RNAs are dark blue, known RNAs are colored as in legend. (B) GC-con- 
tents of transcript classes. Sequences were extracted following annotations (CDS, tRNA, rRNA, sno-like RNA) or RNA-seq analysis (sRNA and long 5' UTR). 



among the top six asRNAs ranked by abundance. This sug- 
gests RPKM is a good criterion for selecting putative functional 
asRNAs. 

New sRNA elements are mostly unstructured. Our anno- 
tation workflow identified 107 putative sRNAs (Table SI). We 
requalified nine of them as ORFs. Eighteen others were previ- 
ously annotated in RFAM 23 or in prior publications as CRISPR, 
C/D box, or H/ACA box RNAs, while 14 were identified in pre- 
vious screens 3,6 but had no assigned function. 

Expectedly, sRNAs with high GC-contents, high conser- 
vation, or high expression were more likely to be reported in 
previous studies. Among 27 sRNAs meeting at least two out of 
these three criteria, only five are novel to this study (PabOl-5, 
Table SI). Of 67 sRNAs identified here for the first time, 47 
can be considered as "high interest" based on any of the criteria: 
high expression, high conservation, strong promoter, or high- 
GC (noted PabOOl to Pab047, Table SI). Of note, 16 sRNAs 
have unusually long sizes of over 200 nt, which includes four 
CRISPR RNAs and nine high interest candidates that were 
not previously reported. Furthermore, two of the high interest 
sRNAs (Pab09, PabOlO) are arranged in reciprocal antisense 
orientation (Fig. S4), reminiscent of a bacterial toxin/anti- 
toxin system. These mutual antisense sRNAs were not classi- 
fied as asRNAs by our annotation pipeline as they overlap only 
partially. 



A high GC-content is a hallmark of structured RNA in 
hyperthermophiles. 13 Prior RNA detection screens based on 
GC-content mostly identified RNAs in the range 50-75% GC. 3 
Here a threshold of 50% GC would retain only 15% of sRNAs. 
Nine of the novel RNAs are above this threshold and may thus 
constitute new instances of archaeal structured RNAs. Most of 
these new high-GC sRNAs are low-expression transcripts, which 
explains why they were not previously detected. Figure 3A shows 
a plot of abundances vs. GC-contents for all sRNAs, identified 
by their functional status. This confirms the general tendency for 
novel sRNAs to harbor low GC% and low abundance. Notable 
exceptions are Pab05 expressed at 293 RPKM and Pab017 at 
155 RPKM. Three novel high-GC RNAs (Pab05, Pab043, 
Pab02) were computationally predicted by Phok et al. 6 but could 
not be confirmed experimentally. 

If we compare the GC content distribution of new sRNAs with 
that of other transcript classes, the AU-richness of new sRNAs 
appears even more strikingly (Fig. 3B). Novel sRNAs peak at 
around 40-45% GC, against 70% for tRNAs and rRNAs, and 
55% for C/D and H/ACA RNAs. Even coding regions have a 
GC content peak (50%) that is significantly higher than that of 
new sRNAs. This extreme AU-richness is a strong indication that 
most of the novel sRNAs identified in this study are generally 
unstructured. Importantly, however, AU-richness does not imply 
absence of function: for instance, CRISPR RNAs are generally 
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AU-rich (Fig. 2A). A strong indication of functionality in the 
P. abyssi AU-rich sRNAs is the fact that 41% of sRNAs identified 
by RNA-seq are conserved in thermococcales or deeper in the 
evolutionary tree (Fig. 2B), which is much higher than the 15% 
GC-rich fraction. Indeed, 16 AU-rich sRNAs (< 45% GC) are 
deeply conserved and 41 AU-rich sRNAs are considered "high 
quality" by the above criteria (Table SI). 

New long 5' UTR elements suggest cis-regulated mRNAs. 
Our workflow identified 98 5' UTRs of 50 nt or more. Among 
these, 18 most likely include new ORFs that constitute exten- 
sions of previous annotated CDSs (Table SI). As for sRNAs, 
most of the "low hanging fruits" in the long 5' UTR category 
had been picked up by previous screens: among 26 long 5' UTRs 
meeting at least two criteria among high GC, high conserva- 
tion, and high expression, only seven are novel to this study or 
were only known as computational predictions (PabO60-66). 
Thirteen of the long 5' UTRs overlap RFAM 23 entries for C/D 
or H/ACA guide RNAs. Location in 5' UTR is a frequent fea- 
ture of archaeal guide RNAs, which are known to often overlap 
coding regions. 24 " 27 In such cases, it is difficult to infer from the 
RNA-seq data alone whether the ncRNA is transcribed indepen- 
dently or processed from the mRNA leader. Both situations have 
been reported. 8 

At the time we submit this manuscript, there is still no experi- 
mentally demonstrated riboswitch or other cis-regulatory RNA 
in archaea. Here, we can however pinpoint several strong candi- 
dates: the crcb RNA known as the fluoride riboswitch 28 is similar 
to one of our highly expressed long 5' UTRs. This ncRNA was 
predicted (and not confirmed) by the Klein et al. 3 and Phok et al. 6 
screens and the presence of the riboswitch inferred in archaea by 
comparative genomics. 29 Here, we confirm its expression in the 
form of a 176 nt leader. Another long leader that was validated in 
previous screens 3,6 is proposed to be a pre-Ql riboswitch by Phok 
et al. 6 (under the name RNA28) and confirmed in our analysis 
(Table SI). Besides previously proposed riboswitches and guide 
RNAs, we confirm the expression of Ssca, an archeal RNA of 
unknown function already identified by Klein 3 and Phok 6 and 
described as RFAM family RF00063. 

Several long 5' UTRs are located upstream of genes that are 
known to be cis-regulated in bacteria, such as ribosomal pro- 
tein genes (rpsl5p, rpll5e, rpl21e, rps6e, rpl7ae), and amino 
acid metabolism pathway genes (aminotransferase, N-terminal 
protease, leu-, val-, and trp-aminoacyl tRNA synthetase). The 
presence of regulatory leader sequences upstream of ribosomal 
protein genes was suggested in S. solfataricus? Here, we observe 
that Archaeal cis-regulatory leaders may also involve amino acid 
biosynthesis genes. 

Few of the novel long 5' UTRs are GC-rich (Fig. 3B; Table SI), 
which argues against extensive secondary structure such as found 
for instance in the T-boxes of bacterial aminoacyl tRNA syn- 
thetases. Most of the high-GC leaders are already annotated as 
C/D or H/ACA box RNAs. Only five new leaders of unknown 
function have a GC content above 50%. One of them, Pab066, 
is 71% GC and is located upstream of an uncharacterized CDS. 

Novel H/ACA RNAs. H/ACA RNAs are well-characterized 
small non-coding RNAs present both in eukaryotes and archaeas. 



They are involved in RNA-guided modifications where a specific 
U residue, often in rRNA, is converted into a pseudouridine, a 
common base modification in the ribosome. Other RNA targets 
like tRNAs also exhibit specific pseudo-uridylations, some of 
which can be directed by the same RNA guided machinery. 30 In 
Pyrococcus genomes, seven H/ACA sRNAs have been identified 
by an in silico approach and validated experimentally. 5 Those H/ 
ACA RNAs are annotated in the Pyrococcus genomes and can 
also be retrieved from their RFAM identifier (Pabl9, Pab21- 
snoR9: RF00065, Pabl60, Pab35-HgcF: RF00058, Pab40- 
HgcG: RF00064, Pab91, Pabl05-HgcE: RF00060). They were 
initially identified using a structure descriptor corresponding to 
a helix-loop-helix motif where an internal loop of variable length 
(from five to 11 residues on both 5' and 3' ends) connects two 
stems: a basal stem and an apical stem including at least seven 
and five base pairs, respectively. Another essential feature in the 
descriptor is the presence of a terminal loop including an embed- 
ded K-turn or K-loop sub-motif which is required to recruit one 
of the proteins (L7Ae) necessary for the H/ACA sRNP assembly. 

A new family of H/ACA-related motifs was recently discov- 
ered in Pyrobaculum genomes. 7 These non-canonical H/ACA 
sRNAs differ from their canonical relatives found in P. abyssi 
in that the first stem is absent. However, the truncated motifs 
still maintain the second stem and a K-turn motif, two structural 
features which are sufficient to preserve the function of these 
H/ACA-like motifs as RNA guides. 7 The 5' and 3' ends of the 
H/ACA-like motifs remain unpaired and may be used as anti- 
sense elements for the RNA-RNA interaction with its target(s). 
Although the most abundant motifs are H/ACA like motifs in 
Pyrobaculum, both H/ACA and H/ACA-like motifs coexist and 
are functional. Nevertheless, the H/ACA motifs in Pyrobaculum 
are slightly different from those in Pyrococcus with a basal stem 
shortened by one or two base pairs (five or six base pairs instead 
of seven or eight) that may also include bulged residues. 

We compared candidates ncRNAs from our RNA-seq analy- 
sis with the results of a genome wide search for novel H/ACA 
RNAs containing H/ACA or H/ACA-like motifs similar to those 
found in Pyrobaculum (see Materials and Methods). Four new 
H/ACA RNA candidates were identified (shown with their pos- 
sible targets in Table S2). Figure 4 presents the predicted sec- 
ondary structure of the three most reliable candidates (PabOl, 
Pab048, and Pab078), based on the criteria mentioned above 
(high GC content, high conservation, high abundance) in a 2D 
structure representation including the H/ACA(like) motif and its 
possible targets. These findings suggest H/ACA-like RNAs are 
not specific to crenarchaea, and may also occur in euryarchaea. 

Discussion 

In hyperthermophile bacteria and archaea, structured RNA such 
as tRNA, rRNA, or modification guide RNA have seldom been 
found with a GC content below 50% (this study and refs. 13 and 
14 — some C/D box RNAs with very short hairpin structures may 
constitute rare exceptions) . It is tempting to generalize this obser- 
vation and propose that no low-GC transcript can ever fold into 
a stable secondary structure at extreme high temperature. Indeed, 
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Figure 4. New H/ACA gene candidates (Pab01, Pab048, Pab078) corresponding to H/ACA or H/ACA-like motifs identified in RNA-seq ncRNAs. The 2D 
structure representations were generated using VARNA - version 3.9. 49 The guide RNA candidates include the annotation of the K-turn or K-loop motif 
(green). The RNA targets (red) are represented as paired to the guide RNA in the internal loop of the H/ACA motif or in the free ends of the H/ACA-like 
motif. The coding strand is indicated for all targets by a (+) or (-) symbol. Targets located on the non-coding strand with respect to the CDS, are noted 
"strand (-)." The CRISPR 2 (+) target follows the nomenclature adopted by Phok et al. 6 All targets shown are expressed at some degree in the genome. 



factors other than a high GC content can contribute to RNA 
stabilization in hyperthermophiles. These include nucleotide 
modifications, 31 the presence of polyamines, 32 and high salt con- 
centration, which was shown to stabilize DNA 33 and RNA. 34 Most 
likely, however, RNA molecules that require secondary structure 
to operate at high temperature should be, just like rRNAs and 
tRNAs, under selection for GC-rich helices in addition to any 
extra stabilizing factor. Consequently, we can reasonably assume 
an absence of secondary structure over most of the AU-rich sRNAs 
and mRNA leaders. However, an absence of secondary structure 



does not mean these RNAs are not functional. Indeed there are 
several strong indications of functionality in a large fraction of 
the AU-rich ncRNAs, including high abundance, the presence 
of strong promoters and, importantly, sequence conservation. 
Then what can these functions be? Small RNAs exert regulatory 
functions by targeting mRNAs and triggering RNA degradation 
or translation block. Cases have recently emerged showing such 
regulatory RNA activity exists in archaea. 12,35 Of course, RNA- 
RNA interactions would require at least a short GC-rich stretch 
for efficient targeting at high temperature, but this may not affect 
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the overall GC-content. Alternatively, AU-rich RNAs may act by 
recruiting proteins through sequence-based interactions or short 
local structures. Again, a short local GC-rich hairpin may form 
in an otherwise AU-rich RNA. This type of behavior would be 
reminiscent of the function of certain long non-coding RNAs 
(IncRNAs) in eukaryotes, which act mainly by recruiting proteins 
although they lack extensive sequence or structure signals. 36 ' 37 

Pervasive transcription was first described in eukaryotic 
organisms and later extended to bacterial cells. It is now clear 
that archeal genomes also produce transcripts over most of their 
intergenic regions and antisense to a large number of genes. Klein 
et al. 3 observed that structured ncRNAs were rarer in the archaea 
P. furiosus and M. janaschii than in bacterial genomes. They 
hypothesized that the use of ncRNA could be selected against in 
high temperature environments, or that organisms with a reduced 
gene set such as many archaea did not require sophisticated RNA- 
based regulation. However, later transcriptome analysis of the 
reduced bacterial genome of Mycoplasma pneumoniae revealed 
over a hundred RNAs with potential regulatory functions, most 
of them antisense with respect to protein-coding genes. 38 Hence 
compact genomes may retain extensive RNA-based regulation. 
It is important though to dissociate RNA functionality from the 
existence of stable secondary structures. The emerging classes 
of eukaryotic regulatory RNAs such as endogenous siRNAs, 
IncRNAs, or XUTs 39 all seem to bind their RNA or protein targets 
without help from extensive secondary structure. Archaea have 
been invaluable models for studying various aspects of eukaryotic 
cell functions. It might turn out that RNA-based regulation is 
another key feature that may benefit from the archaeal model. 

Materials and Methods 

Strain growth. Pyrococcus abyssi GE5 was grown anaerobically 
in ASW-YT rich medium, containing artificial seawater, yeast 
extract and tryptone 40 at 92 °C. After two overnight transfers, 
cells were diluted to 2 x 10 6 cells/mL. After an initial growth of 
3 h, samples (10 mL) were taken every 1.5 h until cells reach the 
stationary phase (11 h of growth). Samples were quickly cooled 
down in liquid nitrogen, spinned, and the pellets were frozen at 
-80°C. 

RNA extraction and sequencing. Total RNAs were extracted 
using the TRIzol method (Invitrogen), following the manu- 
facturer's instructions. RNA quality was monitored by agarose 
gel electrophoresis and concentration was measured using the 
NanoDrop 1000 Spectrophotometer (Thermo Fisher Scientific 
Inc.). A pool of equal amounts of each sample was checked for 
integrity by 2100 Bioanalyzer (Agilent Technologies Inc.) and 
treated to enrich in primary transcripts using a 5 '-phosphate- 
dependent exonuclease (Terminator, Epicenter), following the 
manufacturer's instruction except for the amount of enzyme used 
in the reaction was 1 U per microgram of enriched RNAs and 
incubation was done at 30 °C for 1.5 h. Total RNA composition 
before and after treatment are shown in Figure S8. RNA samples 
were subsequently purified and concentrated using the RNA 
Clean-Up and Concentration Kit (Norgen Biotek Corp) before 
library preparation as described elsewhere in detail. 41 Briefly, the 



strand-specific RNA-seq template library was prepared starting 
from a pool of total samples (50 ng) following the directional 
mRNA-seq library preparation protocol provided by Illumina 
Inc. The library was sequenced (40 bp single-read) using an 
Illumina GA-IIx sequencer. 

ncRNA classification into long 5' UTR, sRNA, asRNA. 
Oriented RNA-seq reads were mapped to the P. abyssi GE5 
genome sequence using the Bowtie program. 42 Uniquely map- 
ping reads were then processed using the DetR'prok ncRNA 
annotation pipeline, 43 available in the Galaxy Tool Shed (http:// 
toolshed.g2.bx.psu.edu) under category "Sequence Analysis." 
Briefly, this workflow clusters mapped reads, compares clusters 
with previous annotation and generates extended annotations 
including 5' and 3' non-coding regions. Furthermore, extended 
annotations merge tandem protein coding, tRNA, and rRNA 
genes separated by intergenic regions shorter than 25 nt or fully 
covered by RNA-seq reads. The workflow then classifies non- 
coding fragments of extended annotations into long 5' UTRs, 
sRNAs, and asRNAs (Fig. S5). Antisense RNAs must be fully 
included in an extended annotation on the opposite strand. We 
used the P. abyssi genome annotation corrected by Gao et al. 18 

The ORF search was conducted using the EasyGene soft- 
ware 44 that uses a HMM model to score putative coding 
sequences based on codon statistics. The model is trained over 
a set of genes that is automatically extracted based on similarity 
with known genes. 

Naming convention: we provide a name in the form PabOx 
(P. abyssi Orsay x) to any novel "high interest" RNA that was 
not experimentally confirmed in prior publications. We define 
as high interest any RNA meeting at least one criterion among 
the following. For sRNAs: high-GC, high conservation, high- 
RPKM, strong promoter; for long 5' UTRs: high-GC, high con- 
servation, high-RPKM; for asRNA: high-GC, strong promoter, 
long size. We define high-GC as > 50% GC; high conservation 
as conserved in five species or more (see below definition of con- 
servation), high-RPKM as more than twice the median RPKM 
value for this type of element and long size as over 200 nt. 

Annotated ncRNAs and extended_annotations are available 
in Supplemental Material as GFF (General Feature Format) files 
providing locations, names, and qualitative information for each 
transcript (Fig. SI). 

Promoter motif detection. We used the MEME web server 45 
using default options except for "search given strand only" to 
detect motifs in the upstream sequences of CDS, transcription 
units, long 5' UTR, sRNA, and asRNA (Fig. 1). We extracted 
50 nt sequences upstream of each annotation. Upstream sequences 
were excluded when another CDS was encountered before 50 nt. 
Sequences shorter than 10 nt were excluded from analysis. 

The list of sRNAs with both BRE and TATA box motifs 
(Table SI) was produced as the combination of two searches: (1) 
the results of a MEME search from the 50 nt upstream sequences 
of all sRNAs and (2) the presence of a BRE-TATA box motif in 
the sRNA upstream region (50 nt) given by a FIMO 46 search, 
performed over the whole genome using the BRE-TATA box 
motif defined from 50 nt upstream regions of expressed genes 
(default option with P value threshold set to 10e-3). 
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Conservation analysis. All identified ncRNA sequences 
were submitted to a BLASTN (version 2.2.15 with param- 
eters: -W 7 -r 2 -q -3 -G 5 -E 2 -e 10) search on a database of 
76 archaeal genomes, selected based both on their "complete" 
status given by the Genomes Online Database (http://www. 
genomesonline.org) in March 2012, and their availability in 
the NCBI genome repository. An empirical filter to remove 
BLASTN alignments with a bit score below 42 (correspond- 
ing to an expected value of 0.06) was applied and all retained 
hits were inspected to assess the extent of conservation for each 
query Inspection was based on the taxonomy lineage offered 
by the ORGANISM record of the genome GenBank file, where 
"Archaea" is the deepest level and species strain the uppermost 
level. Each BLASTN hit corresponds to a species and, thus, a lin- 
eage. For each BLASTN hit, the uppermost common level of the 
two lineages was retained. The conservation level of a sequence 
element (Fig. 2B; Table SI) was then given by the uppermost 
common level among all lineages defined by the BLASTN hits 
of this element. For example, an RNA is found in two spe- 
cies: Pyrococcus abyssi and Methanosarcina acetivorans. The taxo- 
nomic description for P. furiosus is "Archaea; Euryarchaeota; 
Thermococci; Thermococcales; Thermococcaceae; Pyrococcus; 
abyssi" and that for M. acetivorans is "Archaea; Euryarchaeota; 
Methanomicrobia; Methanosarcinales; Methanosarcinaceae; 
Methanosarcina; acetivorans." Then the uppermost common 
level between these two species is "Euryarchaeota." 

H/ACA-like motif search. A standard descriptor-based 
search was performed using RNAMotif 47 to identify canonical 
or non-canonical H/ACA motifs with structural features simi- 
lar to those found in Pyrobaculum 7 except that we also allowed 
the possible substitution of the K-turn motif by a K-loop. A 
series of filters were then used to post-process the results, which 
initially included around 3300 hits: around 100 H/ACA or H/ 
ACA-like motifs with a K-turn, -3100 for H/ACA or H/ACA- 
like motifs with a K-loop. The first filter was expression-based, 
retaining only hits in expressed regions from the RNA-seq data, 
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